Skip to content

KVM local migration issue #3521#3533

Merged
yadvr merged 2 commits into
apache:masterfrom
CLDIN:kvm-local-migration-issue-3521
Aug 7, 2019
Merged

KVM local migration issue #3521#3533
yadvr merged 2 commits into
apache:masterfrom
CLDIN:kvm-local-migration-issue-3521

Conversation

@GabrielBrascher
Copy link
Copy Markdown
Member

Description

Fix regression bug that affects KVM local storage migration. Some of the desired execution flows for KVM local storage migration had been altered to allow only managed storage to execute. Fixed allowing managed and non managed storages to execute.

Fixes #3521

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

  1. Start a VM on a KVM host with local storage;
  2. Migrate the VM to another KVM host that has local storage
  3. Assert that (i) ACS volumes table, (ii) the VM volume content, and (iii) the VM libvirt xml have all been updated correctly.
  4. Keep migrating (repeating steps 2 and 3) the VM around hosts (also migrate back the VM to its first host)

GabrielBrascher added 2 commits July 25, 2019 20:05
Local storage KVM live migration does not support
VIR_MIGRATE_NON_SHARED_INC. Among other fixes that address apache#3521, this
commit allows to use flag VIR_MIGRATE_NON_SHARED_DISK when migrating
local storage VMs.
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jul 31, 2019

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@yadvr yadvr requested a review from nvazquez July 31, 2019 08:07
@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-202

@yadvr yadvr requested review from andrijapanicsb, borisstoyanov and nvazquez and removed request for nvazquez July 31, 2019 10:09
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jul 31, 2019

Live storage migration for vmware and kvm for VMs with shared storage would be needed cc @andrijapanicsb @borisstoyanov @nvazquez
@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-249)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33429 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3533-t249-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_iso.py
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-248)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34904 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3533-t248-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_list_ids_parameter.py
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_list_snapshots Error 0.05 test_list_ids_parameter.py
test_01_scale_vm Failure 32.01 test_scale_vm.py

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-250)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49577 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3533-t250-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vgpu_enabled_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_list_snapshots_with_removed_data_store Error 1.22 test_snapshots.py
test_01_volume_usage Error 10.47 test_usage.py

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Aug 2, 2019

Tests LGTM, can you review @andrijapanicsb @borisstoyanov - we'll need to test live vm with storage migration of shared/nfs storage on KVM


StoragePoolVO destStoragePool = _storagePoolDao.findById(destDataStore.getId());
StoragePoolVO sourceStoragePool = _storagePoolDao.findById(srcVolumeInfo.getPoolId());
if (sourceStoragePool.getPoolType() == StoragePoolType.NetworkFilesystem && destStoragePool.getPoolType() == StoragePoolType.NetworkFilesystem) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GabrielBrascher shouldn't it loop and check for all volumes before returning true? It could be like it can return false if any of the src/dest pools are not nfs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd The idea is that if at least one of the disks is NFS it cannot be migrated on the execution flow of local storage. Taking a second look I see that the method name might need enhancements.

Am I missing something? Please let me know if something doesn't look right. Thanks for the review.

yadvr
yadvr previously requested changes Aug 2, 2019
Copy link
Copy Markdown
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remark left, otherwise LGTM

@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented Aug 6, 2019

CentOS6, VM+ROOT on local storage, fails in CloudStack with following lines on the destination KVM host:

2019-08-06 12:08:05,608 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:a900bfb2) Trying to fetch storage pool 7d69d5bf-0383-476a-90bd-41139db0d596 from libvirt
2019-08-06 12:08:05,610 WARN [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:a900bfb2) Caught:
com.cloud.utils.exception.CloudRuntimeException: Could not fetch storage pool 7d69d5bf-0383-476a-90bd-41139db0d596 from libvirt
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.getStoragePool(KVMStoragePoolManager.java:256)
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.getStoragePool(KVMStoragePoolManager.java:242)

The storage pool 7d69d5bf-0383-476a-90bd-41139db0d596 is the local pool on SOURCE host.

@GabrielBrascher not sure if you want this fixed on CentOS6, since we officially support it ?

Ubuntu 18, VM+ROOT on local storage - migration to another host works fine.

CentOS7 doesn't work (thanks to RedHat $$$hitty commercial logic and qemu code changes...)

Will test shared storage volume migration...

@andrijapanicsb
Copy link
Copy Markdown
Contributor

andrijapanicsb commented Aug 6, 2019

@mike-tutkowski any chance to test live storage migration with KVM+SolidFire - I see a lot of removed/changed code "xxx_managed_yyy"?

@GabrielBrascher can you advise on above ^^ ^?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@syed ^^^

@GabrielBrascher
Copy link
Copy Markdown
Member Author

Thanks for testing @andrijapanicsb. I did all tests on KVM + Ubuntu and it worked well; however, I did not test with CentOS6. Is there any log for share so that we can tackle the CentOS6 issue?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@svenvogel can you kindly test if KVM live migration with SF works fine (NFS/CEPH to SF) ?

@andrijapanicsb
Copy link
Copy Markdown
Contributor

@GabrielBrascher
here they are:
mgmt logs: https://pastebin.com/L97FgFut
agent logs: https://pastebin.com/D6HnZSnJ

Copy link
Copy Markdown
Contributor

@andrijapanicsb andrijapanicsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except previously left 2 comments @GabrielBrascher @rhtyd

  1. centos6 support for locale storage migartion - yes or no ?
  2. would be good that someone test CEPH/NFS to SF if possible with KVM (won't work on CentOS 7 except if very custom qemu/libvirt used)

=====================================
For the record, what has been tested for regression

Baseline was 4.11.3

4.11 KVM:
NFS to NFS - online storage migration (migrateVirtualMachineWithVolume) doesn't work at all (1 or 2 volumes attached to VM). This might have been fixed in 4.12
NFS to NFS - offline migration of VM works only when single ROOT disk attached (fails if DATA disks attached), otherwise direct volume offline migration works fine.

4.11 VMware -
NFS to NFS - online storage migration (migrateVirtualMachineWithVolumes) - works fine for both single disk VM and VM with DATA disks as well
NFS to NFS - offline migration of VM works only when single ROOT disk attached (fails if DATA disks attached), otherwise direct volume offline migration works fine.

==============================
4.13 PR3533 - KVM:
NFS to NFS - online storage migration (migrateVirtualMachineWithVolumes) with just ROOT volume attached works (fails if DATA disks attached, even if migrating choosen to migrate just the ROOT volume)
NFS to NFS - offline migration ov VM works same as in 4.11 (can migrate VM with only ROOT volume, otherwise direct volume offline migration works fine)

4.13 PR3533 VMware
NFS to NFS - online storage migration (migrateVirtualMachineWithVolumes) - works fine for both single disk VM and VM with DATA disks as well
NFS to NFS - offline migration of VM - an improvement over 4.11 - can migrate the whole VM with both ROOT and DATA volumes attached

Copy link
Copy Markdown
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on test results and Andrija's testing

@yadvr yadvr dismissed their stale review August 7, 2019 10:10

LGTM

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Aug 7, 2019

I think for EL6, storage migration (live) would be limiting. The feature should largely be encouraged on newer distros with newer libvirt.

@yadvr yadvr merged commit 5dc982d into apache:master Aug 7, 2019
@andrijapanicsb
Copy link
Copy Markdown
Contributor

@GabrielBrascher need your advice since you guys worked on the local migration with KVM.

I saw a case (didn't reproduce it myself, but while working with the customer, on 4.13) that when KVM is using local storage, and you try to put a host into maintenance, the following seems to happen:

  • User VMs - not sure if they are migrated away - can't say for sure - but I assume they are.
  • CPVM and SSVM are NOT live migrated (with storage), fails as before
  • VRs are live migrated with their storage/volume to a new host - BUT the volume format in DB seems to be set to RAW (again, this is based other-users data, not my own testing - from the cloud.volumes table)

My question is - what is expected to be happening and what is not, and why is the VR's volume in RAW, vs QCOW2 format? Can you lay down the expected behaviour based on the code you've done so far (in 4.13)

Thanks

@GabrielBrascher
Copy link
Copy Markdown
Member Author

@andrijapanicsb I will run some tests and check this out, thanks for the heads-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression bug in 4.13.0.0-SNAPSHOT that affects KVM local storage migration

5 participants